Published as a conference paper at ICLR 2020  
NEVER GIVE UP: LEARNING DIRECTED  
EXPLORATION STRATEGIES  
Adrià Puigdomènech Badia Pablo Sprechmann Alex Vitvitskyi Daniel Guo  
Bilal Piot Steven Kapturowski Olivier Tieleman Martín Arjovsky  
Alexander Pritzel Andew Bolt Charles Blundell  
DeepMind {adriap, psprechmann, avlife, danielguo,  
piot, skapturowski, tieleman,  
apritzel, abolt, cblundell}@google.com  
ABSTRACT  
We propose a reinforcement learning agent to solve hard exploration games by  
learning a range of directed exploratory policies. We construct an episodic memory-  
based intrinsic reward using k-nearest neighbors over the agent’s recent experience  
to train the directed exploratory policies, thereby encouraging the agent to repeat-  
edly revisit all states in its environment. A self-supervised inverse dynamics model  
is used to train the embeddings of the nearest neighbour lookup, biasing the novelty  
signal towards what the agent can control. We employ the framework of Universal  
Value Function Approximators (UVFA) to simultaneously learn many directed  
exploration policies with the same neural network, with different trade-offs be-  
tween exploration and exploitation. By using the same neural network for different  
degrees of exploration/exploitation, transfer is demonstrated from predominantly  
exploratory policies yielding effective exploitative policies. The proposed method  
can be incorporated to run with modern distributed RL agents that collect large  
amounts of experience from many actors running in parallel on separate environ-  
ment instances. Our method doubles the performance of the base agent in all  
hard exploration in the Atari-57 suite while maintaining a very high score across  
the remaining games, obtaining a median human normalised score of 1344.0%.  
Notably, the proposed method is the first algorithm to achieve non-zero rewards  
(
with a mean score of 8,400) in the game of Pitfall! without using demonstrations  
or hand-crafted features.  
1
INTRODUCTION  
The problem of exploration remains one of the major challenges in deep reinforcement learning.  
In general, methods that guarantee finding an optimal policy require the number of visits to each  
state–action pair to approach infinity. Strategies that become greedy after a finite number of steps  
may never learn to act optimally; they may converge prematurely to suboptimal policies, and never  
gather the data they need to learn. Ensuring that all state-action pairs are encountered infinitely often  
is the general problem of maintaining exploration (François-Lavet et al., 2018; Sutton & Barto, 2018).  
The simplest approach for tackling this problem is to consider stochastic policies with a non-zero  
probability of selecting all actions in each state, e.g. -greedy or Boltzmann exploration. While these  
techniques will eventually learn the optimal policy in the tabular setting, they are very inefficient and  
the steps they require grow exponentially with the size of the state space. Despite these shortcomings,  
they can perform remarkably well in dense reward scenarios (Mnih et al., 2015). In sparse reward  
settings, however, they can completely fail to learn, as temporally-extended exploration (also called  
deep exploration) is crucial to even find the very few rewarding states (Osband et al., 2016).  
Equal contribution.  
1
Published as a conference paper at ICLR 2020  
Recent approaches have proposed to provide intrinsic rewards to agents to drive exploration, with a  
focus on demonstrating performance in non-tabular settings. These intrinsic rewards are proportional  
to some notion of saliency quantifying how different the current state is from those already visited  
(
Bellemare et al., 2016; Haber et al., 2018; Houthooft et al., 2016; Oh et al., 2015; Ostrovski et al.,  
2
017; Pathak et al., 2017; Stadie et al., 2015). As the agent explores the environment and becomes  
familiar with it, the exploration bonus disappears and learning is only driven by extrinsic rewards.  
This is a sensible idea as the goal is to maximise the expected sum of extrinsic rewards. While  
very good results have been achieved on some very hard exploration tasks, these algorithms face  
a fundamental limitation: after the novelty of a state has vanished, the agent is not encouraged to  
visit it again, regardless of the downstream learning opportunities it might allow (Bellemare et al.,  
2
016; Ecoffet et al., 2019; Stanton & Clune, 2018). Other methods estimate predictive forward  
models (Haber et al., 2018; Houthooft et al., 2016; Oh et al., 2015; Pathak et al., 2017; Stadie  
et al., 2015) and use the prediction error as the intrinsic motivation. Explicitly building models like  
this, particularly from observations, is expensive, error prone, and can be difficult to generalize to  
arbitrary environments. In the absence of the novelty signal, these algorithms reduce to undirected  
exploration schemes, maintaining exploration in a non-scalable way. To overcome this problem, a  
careful calibration between the speed of the learning algorithm and that of the vanishing rewards is  
required (Ecoffet et al., 2019; Ostrovski et al., 2017).  
The main idea of our proposed approach is to jointly learn separate exploration and exploitation  
policies derived from the same network, in such a way that the exploitative policy can concentrate on  
maximising the extrinsic reward (solving the task at hand) while the exploratory ones can maintain  
exploration without eventually reducing to an undirected policy. We propose to jointly learn a family  
of policies, parametrised using the UVFA framework (Schaul et al., 2015a), with various degrees  
of exploratory behaviour. The learning of the exploratory policies can be thought of as a set of  
auxiliary tasks that can help build a shared architecture that continues to develop even in the absence  
of extrinsic rewards (Jaderberg et al., 2016). We use reinforcement learning to approximate the  
optimal value function corresponding to several different weightings of intrinsic rewards.  
We propose an intrinsic reward that combines per-episode and life-long novelty to explicitly encourage  
the agent to repeatedly visit all controllable states in the environment over an episode. Episodic  
novelty encourages an agent to periodically revisit familiar (but potentially not fully explored) states  
over several episodes, but not within the same episode. Life-long novelty gradually down-modulates  
states that become progressively more familiar across many episodes. Our episodic novelty uses an  
episodic memory filled with all previously visited states, encoded using the self-supervised objective  
of Pathak et al. (2017) to avoid uncontrollable parts of the state space. Episodic novelty is then defined  
as similarity of the current state to previously stored states. This allows the episodic novelty to rapidly  
adapt within an episode: every observation made by the agent potentially changes the per-episode  
novelty significantly. Our life-long novelty multiplicatively modulates the episodic similarity signal  
and is driven by a Random Network Distillation error (Burda et al., 2018b). In contrast to the episodic  
novelty, the life-long novelty changes slowly, relying upon gradient descent optimisation (as opposed  
to an episodic memory write for episodic novelty). Thus, this combined notion of novelty is able to  
generalize in complex tasks with large, high dimensional state spaces in which a given state is never  
observed twice, and maintain consistent exploration both within an episode and across episodes.  
This paper makes the following contributions: (i) defining an exploration bonus combining life-long  
and episodic novelty to learn exploratory strategies that can maintain exploration throughout the  
agent’s training process (to never give up), (ii) to learn a family of policies that separate exploration  
and exploitation using a conditional architecture with shared weights, (iii) experimental evidence that  
the proposed method is scalable and performs on par or better than state-of-the-art methods on hard  
exploration tasks. Our work differs from Savinov et al. (2018) in that it is not specialised to navigation  
tasks, our method incorporates a long-term intrinsic reward and is able to separate exploration and  
exploitation policies. Unlike Stanton & Clune (2018), our work relies on no privileged information  
and combines both episodic and non-episodic novelty, obtaining superior results. Our work differs  
from Beyer et al. (2019) in that we learn multiple policies by sharing weights, rather than just a  
common replay buffer, and our method does not require exact counts and so can scale to more realistic  
domains such as Atari. The paper is organized as follows. In Section 2 we describe the proposed  
intrinsic reward. In Section 3, we describe the proposed agent and general framework. In Section 4  
we present experimental evaluation.  
2
Published as a conference paper at ICLR 2020  
RND random network  
life-long novelty  
module  
multiplicative  
modulation  
classifier  
RND prediction network  
embedding  
network  
k-nearest  
neighbors  
controllable state  
episodic novelty  
module  
embedding network  
episodic memory  
Figure 1: (left) Training architecture for the embedding network (right) NGU’s reward generator.  
2
THE NEVER-GIVE-UP INTRINSIC REWARD  
We follow the literature on curiosity-driven exploration, where the extrinsic reward is augmented  
with an intrinsic reward (or exploration bonus). The augmented reward at time  
t
is then defined as  
is a positive  
e
i
e
i
and r are respectively the extrinsic and intrinsic rewards, and  
t
rt = r + βr , where  
r
β
t
t
t
scalar weighting the relevance of the latter. Deep RL agents are typically trained on the augmented  
e
t
reward rt, while performance is measured on extrinsic reward  
r
only. This section describes the  
i
proposed intrinsic reward r .  
t
i
t
Our intrinsic reward  
r
satisfies three properties: (i) it rapidly discourages revisiting the same state  
within the same episode, (ii) it slowly discourages visits to states visited many times across episodes,  
iii) the notion of state ignores aspects of an environment that are not influenced by an agent’s actions.  
(
We begin by providing a general overview of the computation of the proposed intrinsic reward. Then  
we provide the details of each one of the components. The reward is composed of two blocks: an  
episodic novelty module and an (optional) life-long novelty module, represented in red and green  
respectively in Fig. 1 (right). The episodic novelty module computes our episodic intrinsic reward  
and is composed of an episodic memory,  
M, and an embedding function f, mapping the current  
observation to a learned representation that we refer to as controllable state. At the beginning of each  
episode, the episodic memory starts completely empty. At every step, the agent computes an episodic  
episodic  
, and appends the controllable state corresponding to the current observation  
to the memory . To determine the bonus, the current observation is compared to the content of the  
episodic memory. Larger differences produce larger episodic intrinsic rewards. The episodic intrinsic  
intrinsic reward,  
r
M
t
episodic  
promotes the agent to visit as many different states as possible within a single episode.  
This means that the notion of novelty ignores inter-episode interactions: a state that has been visited  
thousands of times gives the same intrinsic reward as a completely new state as long as they are  
equally novel given the history of the current episode.  
reward  
r
t
A life-long (or inter-episodic) novelty module provides a long-term novelty signal to statefully control  
the amount of exploration across episodes. We do so by multiplicatively modulating the exploration  
episodic  
with a life-long curiosity factor, αt. Note that this modulation will vanish over time,  
episodic  
bonus  
r
t
reducing our method to using the non-modulated reward. Specifically, we combine αt with  
r
t
as follows (see also Fig. 1 (right)):  
i
episodic  
t
r = r  
· min {max {αt, 1} , L}  
(1)  
t
where  
L is a chosen maximum reward scaling (we fix L = 5 for all our experiments). Mixing rewards  
i
this way, we leverage the long-term novelty detection that αt offers, while r continues to encourage  
t
our agent to explore all the controllable states.  
p
Embedding network: f : O → R maps the current observation to a  
p
-dimensional vector corre-  
sponding to its controllable state. Consider an environment that has a lot of variability independent of  
the agent’s actions, such as navigating a busy city with many pedestrians and vehicles. An agent could  
3
Published as a conference paper at ICLR 2020  
visit a large number of different states (collecting large cumulative intrinsic rewards) without taking  
any actions. This would not lead to performing any meaningful form of exploration. To avoid such  
meaningless exploration, given two consecutive observations, we train a Siamese network (Bromley  
et al., 1994; Koch et al., 2015)  
f to predict the action taken by the agent to go from one observation  
to the next (Pathak et al., 2017). Intuitively, all the variability in the environment that is not affected  
by the action taken by the agent would not be useful to make this prediction. More formally, given a  
triplet {xt, at, xt+1} composed of two consecutive observations, xt and xt+1, and the action taken  
by the agent at, we parameterise the conditional likelihood as p(a|xt, xt+1) = h(f(xt), f(xt+1)),  
where  
h is a one hidden layer MLP followed by a softmax. The parameters of both h and f are  
trained via maximum likelihood. This architecture can be thought of as a Siamese network with a  
one-layer classifier on top, see Fig. 1 (left) for an illustration. For more details about the architecture,  
see App. H.1, and hyperparameters, see App. F.  
Episodic memory and intrinsic reward: The episodic memory  
based memory that stores the controllable states in an online fashion (Pritzel et al., 2017). At time  
, the memory contains the controllable states of all the observations visited in the current episode,  
f(x0), f(x1), . . . , f(xt1)}. Inspired by theoretically-justified exploration methods turning state-  
action counts into a bonus reward (Strehl & Littman, 2008), we define our intrinsic reward as  
M is a dynamically-sized slot-  
t
{
1
1
episodic  
r
= p  
q  
(2)  
t
P
n(f(xt))  
K(f(xt), fi) + c  
fiNk  
where n(f(xt)) is the counts for the visits to the abstract state f(xt). We approximate these counts  
p
p
n(f(xt)) as the sum of the similarities given by a kernel function K : R ×R  R, over the content  
of  
M
(
M
. In practice, pseudo-counts are computed using the  
k
-nearest neighbors of f(xt) in the memory  
, denoted by Nk = {fi} . The constant c guarantees a minimum amount of “pseudo-counts”  
is a Dirac delta function, the approximation  
k
i=1  
fixed to 0.001 in all our experiments). Note that when K  
becomes exact but consequently provides no generalisation of exploration required for very large  
state spaces. Following Blundell et al. (2016); Pritzel et al. (2017), we use the inverse kernel for K,  
d (x,y)  
K(x, y) =  
(3)  
2
2
m
+ ꢀ  
d
3
2
d
where  
is a small constant (fixed to 10 in all our experiments),  
d
is the Euclidean distance and  
m
is a running average of the squared Euclidean distance of the  
k
-th nearest neighbors. This running  
average is used to make the kernel more robust to the task being solved, as different tasks may have  
different typical distances between learnt embeddings. A detailed computation of the episodic reward  
can be found in Alg. 1 in App. A.1.  
Integrating life-long curiosity: In principle, any long-term novelty estimator could be used as a  
basis for the modulator αt. We found Random Network Distillation (Burda et al., 2018b, RND)  
worked well, is simple to implement and easy to parallelize. The RND modulator αt is defined by  
k
introducing a random, untrained convolutional network g : O → R , and training a predictor network  
k
gˆ : O → R that attempts to predict the outputs of  
g
on all the observations that are seen during  
2
training by minimizing err(xt) = || gˆ (xt; θ)  g(xt)|| with respect to the parameters of gˆ  
,
θ
. We  
then define the modulator αt as a normalized mean squared error, as done in Burda et al. (2018b):  
err(xt)µe  
αt = 1 +  
, where σe and µe are running standard deviation and mean for err(xt). For  
σe  
more details about the architecture, see App. H.2, and hyperparameters, see App. F.  
3
THE NEVER-GIVE-UP AGENT  
In the previous section we described an episodic intrinsic reward for learning policies capable of  
maintaining exploration in a meaningful way throughout the agent’s training process. We now  
demonstrate how to incorporate this intrinsic reward into a full agent that maintains a collection of  
value functions, each with a different exploration-exploitation trade-off.  
Using intrinsic rewards as a means of exploration subtly changes the underlying Markov Decision  
e
t
i
t
Process (MDP) being solved: if the augmented reward rt = r + βr varies in ways unpredictable  
from the action and states, then the decision process may no longer be a MDP, but instead be a  
Partially Observed MDP (POMDP). Solving PODMPs can be much harder than solving MDPs, so to  
4
Published as a conference paper at ICLR 2020  
avoid this complexity we take two approaches: firstly, the intrinsic reward is fed directly as an input  
to the agent, and secondly, our agent maintains an internal state representation that summarises its  
history of all inputs (state, action and rewards) within an episode. As the basis of our agent, we use  
Recurrent Replay Distributed DQN (Kapturowski et al., 2019, R2D2) as it combines a recurrent state,  
experience replay, off-policy value learning and distributed training, matching our desiderata.  
Unlike most of the previously proposed intrinsic rewards (as seen in Section 1), the never-give-up  
intrinsic reward does not vanish over time, and thus the learned policy will always be partially driven  
by it. Furthermore, the proposed exploratory behaviour is directly encoded in the value function and  
as such it cannot be easily turned off. To overcome this problem, we proposed to jointly learn an  
explicit exploitative policy that is only driven by the extrinsic reward of the task at hand.  
Proposed architecture: We propose to use a universal value function approximator (UVFA)  
Q(x, a, βi) to simultaneously approximate the optimal value function with respect to a family of  
βi  
e
t
i
t
N1  
N
of values {βi}  
i=0  
augmented rewards of the form rt = r + βir . We employ a discrete number  
including the special case of β0 = 0 and βN1 = β where β is the maximum value chosen. In this  
way, one can turn-off exploratory behaviour simply by acting greedily with respect to Q(x, a, 0).  
Even before observing any extrinsic reward, we are able to learn a powerful representation and set  
of skills that can be quickly transferred to the exploitative policy. In principle, one could think of  
having an architecture with only two policies, one with β0 = 0 and one with β1 > 0. The advantage  
of learning a larger number of policies comes from the fact that exploitative and exploratory policies  
could be quite different from a behaviour standpoint. Having a larger number of policies that change  
smoothly allows for more efficient training. For a detailed description of the specific values of βi  
we use in our experiments, see App.A. We adapt the R2D2 agent that uses the dueling network  
architecture of Wang et al. (2015) with an LSTM layer after a convolutional neural network. We  
concatenate to the output of the network a one-hot vector encoding the value of βi, the previous  
i
e
. We describe the  
action at1, the previous intrinsic reward  
r
and the previous extrinsic reward  
r
t
t
precise architecture in App. H.3 and hyperparameters in App. F.  
RL Loss functions: As a training loss we use a transformed Retrace double Q-learning loss. In  
App. E we provide the details of the computation of the Retrace loss (Munos et al., 2016). In addition,  
we associate for each βi  
a γi, with γ0 = 0.997, and γN1 = 0.99. We remark that the exploitative  
policy β0 is associated with the highest discount factor γ0 = γmax and the most exploratory policy  
βN1 with the smallest discount factor γ0 = γmin. We can use smaller discount factors for the  
exploratory policies because the intrinsic reward is dense and the range of values is small, whereas  
we would like the highest possible discount factor for the exploitative policy in order to be as close as  
possible from optimizing the undiscounted return. For a detailed description of the specific values of  
γi we use in our experiments, see App. A.  
Distributed training: Recent works in deep RL have achieved significantly improved performance  
by running on distributed training architectures that collect large amounts of experience from many  
actors running in parallel on separate environment instances (Andrychowicz et al., 2018; Barth-Maron  
et al., 2018; Burda et al., 2018b; Espeholt et al., 2018; Horgan et al., 2018; Kapturowski et al., 2019;  
Silver et al., 2016). Our agent builds upon the work by Kapturowski et al. (2019) to decouple learning  
from acting, with actors (256 unless stated otherwise) feeding experience into a distributed replay  
buffer and the learner training on randomly sampled batches from it in a prioritized way (Schaul  
et al., 2015b). Please refer to App. A for details.  
4
EXPERIMENTS  
We begin by analyzing the exploratory policy of the Never Give Up (NGU) agent with a single  
reward mixture. We perform such analysis by using a minimal example environment in Section 4.1.  
We observe the performance of its learned policy, as well as highlight the importance of learning a  
representation for abstract states. In Section 4.2, we analyze the performance of the full NGU agent,  
evaluating its effectiveness on the Arcade Learning Environment (ALE; Bellemare et al. (2013)). We  
measure the performance of the agent against baselines on hard exploration games, as well as dense  
reward games. We expand on the analysis of the NGU agent by running it on the full set of Atari  
games, as well as showing multiple ablations on important choices of hyperparameters of the model.  
5
Published as a conference paper at ICLR 2020  
Figure 2: (Left and Center) Sample screens of Random Disco Maze. The agent is in green, and  
pathways in black. The colors of the wall change at every time step. (Right) Learning curves for  
Random projections vs. learned controllable states and a baseline RND implementation.  
4
.1 CONTROLLED SETTING ANALYSIS  
In this section we present a simple example to highlight the effectiveness of the exploratory policy of  
the NGU agent, as well as the importance of estimating the exploration bonus using a controllable  
state representation. To isolate the effect of the exploratory policy, we restrict the analysis to the  
case of a single exploratory policy (N = 1, with β = 0.3). We introduce a gridworld environment,  
Random Disco Maze, implemented with the pycolab game engine (Stepleton, 2017), depicted in Fig.  
2
(left). At each episode, the agent finds itself in a new randomly generated maze of size 21x21.  
The agent can take four actions left, right, up, down , moving a single position at a time. The  
{
}
environment is fully observable. If the agent steps into a wall, the episode ends and a new maze is  
generated. Crucially, at every time step, the color of each wall fragment is randomly sampled from  
a set of five possible colors, enormously increasing the number of possible states. This irrelevant  
variability in color presents a serious challenge to algorithms using exploration bonuses based on  
novelty, as the agent is likely to never see the same state twice. This experiment is purely exploratory,  
with no external reward. The goal is to see if the proposed model can learn a meaningful directed  
exploration policy despite the large visual distractions providing a continual stream of observation  
novelty to the agent. Fig. 2 shows the percentage of unique states (different positions in the maze)  
visited by agents trained with the proposed model and one in which the mapping f is a fixed random  
projection (i.e. is untrained). The proposed model learns to explore any maze sampled from the  
f
task-distribution. The agent learns a strategy that resembles depth-first search: it explores as far as  
possible along each branch before backtracking (often requiring backtracking a few dozen steps to  
reach an unexplored area). The model with random projections, as well as our baseline of RND, do  
1
not show such exploratory behaviour . Both models do learn to avoid walking into walls, doing so  
would limit the amount of intrinsic reward it would receive. However, staying alive is enough: simply  
oscillating between two states will produce different (and novel) controllable states at every time step.  
4
.2 ATARI RESULTS  
In this section, we evaluate the effectiveness of the NGU agent on the Arcade Learning Environment  
ALE; (Bellemare et al., 2013)). We use standard Atari evaluation protocol and pre-processing as  
(
described in Tab. 8 of App. F.4, with the only difference being that we do not use frame stacking. We  
restrict NGU to using the same setting and data consumption as R2D2, the best performing algorithm  
on Atari (Kapturowski et al., 2019). While we compare our results with the best published methods  
on this benchmark, we note that different baselines use very different training regimes with very  
different computational budgets. Comparing distributed and non-distributed methods is in general  
difficult. In an effort to properly assess the merits of the proposed model we include two additional  
baselines: as NGU is based on R2D2 using the Retrace loss (instead of its n-step objective) we  
include this as a baseline, and since we use RND as a reward modulator, we also include R2D2 with  
Retrace using the RND intrinsic reward. These methods are all run for 35 billion frames using the  
same protocol as that of R2D2 (Kapturowski et al., 2019). We detail the use of compute resources of  
the algorithms in App. D. We report the return averaged over 3 different seeds.  
Architecture: We adopt the same core architecture as that used by the R2D2 agent to facilitate  
comparisons. There are still a few choices to make, namely: the size of the learned controllable states,  
1See video of the trained agent here: https://youtu.be/9HTY4ruPrHw  
6
Published as a conference paper at ICLR 2020  
Figure 3: Human Normalized Scores on dense reward and hard exploration games.  
the clipping factor  
L in (1), and the number of nearest neighbours to use for computing pseudo-counts  
in (2). We selected these hyperparameters by analysing the performance of the single policy agent,  
NGU(N = 1), on two representative exploration games: Montezuma’s Revenge and Pitfall!. We  
report this study in App. B. We used the same fixed set of hyperparameters in all the remaining  
experiments.  
NGU agent: We performed further ablations in order to better understand several major design  
choices of the full NGU agent on a set of 8 Atari games: the set of 5 dense reward games chosen to  
select the hyperparameters of Mnih et al. (2015), as well as 3 hard exploration games (Montezuma’s  
Revenge, Pitfall!, and Private Eye). For a detailed description of the results on these games as well  
as results on more choices of hyperparameters, please see App.C. The ablations we perform are on  
the number of mixtures  
maximum magnitude of  
N
β
, the impact of the off-policy data used (referred to as CMR below), the  
(by default 0.3 if not explicitly mentioned), the use of RND to scale the  
intrinsic reward, and the performance of the agent in absence of extrinsic rewards. We denote by  
Cross Mixture Ratio (CMR) the proportion in the training batches of experience collected using  
different values of βi from the one being trained. A CMR of  
data produced by the same βi, while a CMR of 0.5 means using equal amounts of data produced by  
βi and βj=i. Our base agent NGU has a CMR of 0.  
0 means training each policy only with  
6
The results are shown in Fig. 3. Several conclusions can be extracted from these results: Firstly,  
sharing experience from all the actors (with CMR of 0.5) slightly harms overall average performance  
on hard exploration games. This suggests that the power of acting differently for different condi-  
tioning mixtures is mostly acquired through the shared weights of the model rather than shared  
data. Secondly, we observe an improvement, on average, from increasing the number of mixtures  
N
on hard exploration games. Thirdly, as one can observe in analyzing the value of β, the value  
of β = 0.3 is the best average performing value, whereas β = 0.2 and β = 0.5 make the average  
performance worse on those hard exploration games. These values indicate, in this case, the limits in  
which NGU is either not having highly enough exploratory variants (  
β
too low) or policies become  
too biased towards exploratory behavior (  
β too high). Further, the use of the RND factor seems to  
be greatly beneficial on these hard exploration games. This matches the great success of existing  
literature, in which long-term intrinsic rewards appear to have a great impact (Bellemare et al., 2016;  
Ostrovski et al., 2017; Choi et al., 2018). Additionally, as outlined above, the motivation behind  
studying these variations on this set of 8 games is that those hyperparameters are of general effect,  
rather than specific to exploration. However, surprisingly, with the exception of the case of removing  
the extrinsic reward, they seem to have little effect on the dense reward games we analyze (with all  
error bars overlapping). This suggests that NGU and its hyperparameters are relatively robust: as  
extrinsic rewards become dense, intrinsic rewards (and their relative weight to the extrinsic rewards)  
e
naturally become less relevant. Finally, even without extrinsic reward  
r
, we can still obtain average  
superhuman performance on the 5 dense reward games we evaluate, indicating that the exploration  
policy of NGU is an adequate high performing prior for this set of tasks. That confirms the findings  
of Burda et al. (2018a), where they showed that there is a high degree of alignment between the  
intrinsic curiosity objective and the hand-designed extrinsic rewards of many game environments. The  
heuristics of surviving and exploring what is controllable seem to be highly general and beneficial, as  
we have seen in the Disco Maze environment in Section 4.1, as well as on Atari.  
7
Published as a conference paper at ICLR 2020  
Algorithm  
Human  
Best baseline  
RND  
R2D2+RND  
R2D2(Retrace)  
NGU(N=1)-RND  
NGU(N=1)  
NGU(N=32)  
Gravitar  
3.4k  
15.7k  
MR  
4.8k  
11.6k  
10.1k  
Pitfall!  
6.5k  
0.0  
-3  
PrivateEye  
69.6k  
11k  
8.7k  
Solaris  
12.3k  
5.5k  
Venture  
1.2k  
2.0k  
1.9k  
3.9k  
3.3k  
15.6k  
13.3k  
12.4k  
11.0k  
14.1k  
±
±
±
±
±
0.6k 10.4k  
±
1.2k -0.5±0.3  
19.5k±3.5k  
4.3k  
6.0k  
5.7k  
±
±
±
±
±
0.6k 2.7k±0.0k  
0.6k 2.3k±0.4k -3.5±1.2  
32.5k±4.7k  
1.1k 2.0k  
±
±
0.0k  
37.9  
0.8k 3.0k±0.0k 15.2k±9.4k 40.6k±0.0k  
1.8k 46.4  
1.6k 876.3  
0.7k 8.7k±1.2k 9.4k±2.2k 60.6k  
±
16.3k 5.9k  
±
114.5  
0.5k 10.4k  
±
1.6k 8.4k±4.5k 100.0k±0.4k 4.9k  
0.3k 1.7k  
±
0.1k  
Table 1: Results against exploration algorithm baselines. Best baseline takes the best result among  
R2D2 (Kapturowski et al., 2019), DQN + PixelCNN (Ostrovski et al., 2017), DQN + CTS (Bellemare  
et al., 2016), RND (Burda et al., 2018b), and PPO + CoEx (Choi et al., 2018) for each game.  
Figure 4: Human Normalized Scores on the 6 hard exploration games.  
Hard exploration games: We now evaluate the full NGU agent on the six hard exploration games  
identified by Bellemare et al. (2016). We summarise the results on Tab. 1. The proposed method  
achieves on similar or higher average return than state-of-the-art baselines on all hard exploration  
tasks. Remarkably, to the best of our knowledge, this is the first method without use of privileged  
information that obtains a positive score on Pitfall!, with NGU(N = 1)-RND obtaining a best score  
of 15,200. Moreover, in 4 of the 6 games, NGU(N = 32) appears to substantially improve against  
the single mixture case NGU(N = 1). This shows how the exploitative policy is able to leverage the  
shared weights with all the intrinsically-conditioned mixtures to explore games in which it is hard  
to do so, but still optimize towards maximizing the final episode score. In Fig. 4 we can see these  
conclusions more clearly: both in terms of mean and median human normalized scores, NGU greatly  
improves upon existing algorithms.  
While direct comparison of the scores is interesting, the emphasis of this work is on learning directed  
exploration strategies that encourage the agent to cover as much of the environment as possible. In  
Fig. 4.2 (left) we observe the average episodic return of NGU run with and without RND on Pitfall!.  
NGU(N = 32) is able to learn a directed exploration policy capable of exploring an average of 46  
rooms per episode, crossing 14 rooms before receiving the first extrinsic reward. We also observe that,  
in this case, using RND makes our model be less data efficient. This is also the case for NGU(N = 1),  
as observed on NGU(N = 1)-RND in Tab. 1, the best performing Pitfall! agent. We conjecture three  
main hypotheses to explain this: firstly, on Pitfall! (and unlike Montezuma’s Revenge) rooms are  
frequently aliased to one another, thus the agent does not obtain a large reward for discovering new  
rooms. This phenomenon would explain the results seen in Fig. 4.2 (right), in which RND greatly  
improves the results of NGU(N = 32). Secondly, the presence of a timer in the observation acts as  
a spurious source of novelty which greatly increases the number of unique states achievable even  
within a single room. Thirdly, as analyzed in Section 3.7 of Burda et al. (2018b), RND-trained agents  
often keep ’interacting with danger’ instead of exploring further, and Pitfall! is a game in which this  
can be highly detrimental, due to the high amount of dangerous elements in each room. Finally, we  
observe that NGU(N = 1) obtains better results than NGU(N = 32). Our intuition is that, in this  
case, a single policy should be simpler to learn and can achieve quite good results on this task, since  
exploration and exploitation policies are greatly similar.  
Dense reward games: Tab. 2 shows the results of our method on dense reward games. NGU(N = 1  
)
underperforms relative to R2D2 on most games (indeed the same can be said of R2D2(Retrace) that  
serves as the basis of NGU). Since the intrinsic reward signal may be completely misaligned with the  
goal of the game, these results may be expected. However, there are cases such as Pong, in which  
NGU(N = 1) catastrophically fails to learn to perform well. Here is where NGU(N = 32) solves  
8
Published as a conference paper at ICLR 2020  
Figure 5: Mean episodic return for agents trained (left) Pitfall! and (right) Montezuma’s Revenge.  
Algorithm  
Human  
Pong  
14.6  
QBert  
13.4k  
Breakout  
30.5  
Space Invaders  
1.6k  
Beam Rider  
16.9k  
R2D2  
21.0  
408.8k  
837.7  
43.2k  
188.2k  
R2D2+RND  
R2D2(Retrace)  
NGU(N=1)-RND  
NGU(N=1)  
NGU(N=32)  
20.7±0.0  
20.9±0.0  
-8.1±1.7  
-9.4±2.6  
19.6±0.1  
353.5k  
415.6k  
647.1k  
684.7k±8.8k  
±
41.0k 815.8±5.3  
55.8k 838.3±7.0  
50.5k 864.0±0.0  
54.5k±2.8k  
35.0k±13.0k  
45.3k±4.9k  
43.0k±3.9k  
44.6k±1.2k  
85.7k  
±
9.0k  
±
±
111.1k  
166.5k  
114.6k  
±
5.0k  
8.6k  
2.3k  
±
±
864.0±0.0  
465.8k  
±
84.9k 532.8±16.5  
68.7k±  
11.1k  
Table 2: Results against baselines on dense reward games.  
this issue: the exploitative policy learned by the agent is able to reliably learn to play the game.  
Nevertheless, NGU(N = 32) has limitations: even though its learned policies are vastly superhuman  
and empirically reasonable, they do not match R2D2 on Breakout and Beam Rider. This suggests  
that the representations learned by using the intrinsic signal still slightly interfere with the learning  
process of the exploitative mixture. We hypothesize that alleviating this further by having non-shared  
representations between mixtures should help in solving this issue.  
Results on all Atari 57 games: The proposed method achieves an overall median score of 1354.4%,  
compared to 95% for Nature DQN baseline, 191.8% for IMPALA, 1920.6% for R2D2, and 1451.8%  
for R2D2 using retrace loss. Please refer to App. G for separate results on individual games. Even  
though its overall median score is lower than R2D2, NGU maintains good performance on all games,  
performing above human level on 51 out of the 57 games. This also shows further confirmation that  
the learned exploitative mixture is still able to focus on maximizing the score of the game, making  
the algorithm able to obtain great performance across all games.  
Analysis of Multiple Mixtures: in Fig. 6, we can see NGU(N = 32) evaluated with β0 = 0 (used  
in all reported numerical results) against NGU(N = 32) evaluated with β31 = 0.3. We can observe  
different trends in the games: on Q*Bert the policies of the agent seem to converge to the exploitative  
policy regardless of the  
β condition, with its learning curve being almost identical to the one shown  
for R2D2 in Kapturowski et al. (2019). As seen in App. G, this is common in many games. The  
second most common occurrence is what we see on Pitfall! and Beam Rider, in which the policies  
quantitatively learn very different behaviour. In these cases, the exploitative learns to focus on its  
objective, and sometimes it does so by benefiting from the learnings of the exploratory policy, as it is  
2
the case in Pitfall! , where R2D2 never achieves a positive score. Finally, there is the exceptional case  
of Montezuma’s Revenge, in which the reverse happens: the exploratory policy obtains better score  
than the exploitative policy. In this case, extremely long-term credit assignment is required in order  
for the exploitative policy to consolidate the knowledge of the exploratory policy. This is because, to  
achieve scores that are higher than 16k, the agent needs to go to the second level of the game, going  
through many non-greedy and sometimes irreversible actions. For a more detailed analysis of this  
specific problem, see App. I.2.  
5
CONCLUSIONS  
We present a reinforcement learning agent that can effectively learn on both sparse and dense reward  
scenarios. The proposed agent achieves high scores in all Atari hard-exploration games, while still  
maintaining a very high average score over the whole Atari-57 suite. Remarkably, it is, to the best of  
our knowledge, the first algorithm to achieve non-zero rewards on the challenging game of Pitfall!  
without relying on human demonstrations, hand-crafted features, or manipulating the state of the  
environment. A central contribution of this work is a method for learning policies that can maintain  
2See videos of NGU on Pitfall with β  
0
9
Published as a conference paper at ICLR 2020  
Figure 6: NGU(N=32) behavior for β0 (blue) and β31 (orange).  
exploration throughout the training process. In the absence of extrinsic rewards, the method produces  
a policy that aims at traversing all controllable states of the MDP in a depth-first manner. We highlight  
that this could have impact beyond this specific application and/or algorithmic choices. For instance,  
one could use it as a behaviour policy to facilitate learning models of the environment or as a prior  
for planning methods.  
The proposed method is able to leverage large amounts of compute by running on distributed training  
architectures that collect large amounts of experience from many actors running in parallel on separate  
environment instances. This has been crucial for solving most challenging tasks in deep RL in recent  
years (Andrychowicz et al., 2018; Espeholt et al., 2018; Silver et al., 2016), and this method is able  
to utilize such compute to obtain strong performance on the set of hard-exploration games on Atari.  
While this is certainly a desirable feature and allows NGU to achieve a remarkable performance, it  
comes at the price of high sample complexity, consuming a large amount of simulated experience  
taking several days of wall-clock time. An interesting avenue for future research lies in improving  
the data efficiency of these agents.  
Further, the episodic novelty measure relies on the notion of controllable states to drive exploration.  
As observed on the Atari hard-exploration games, this strategy performs well on several tasks, but  
it may not be the right signal for some environments. For instance, in some environments it might  
take more than two consecutive steps to see the consequences of the actions taken by the agent. An  
interesting line for future research is learning effective controllable states beyond a simple inverse  
dynamics model.  
Additionally, the proposed work relies on the assumption that while different, one can find good  
exploratory and exploitative policies that are similar enough to be effectively represented using a  
shared parameterization (implemented using the UVFA framework). This can be limiting when the  
two policies are almost adversarial. This can be seen in games such as ‘Surround’ and ‘Ice hockey’.  
Finally, the hyperparameter  
significantly different extrinsic reward scales, might require different values of  
avenue forward is the dynamic adaptation of , which could be done by using techniques such as  
β
depends on the scale of the extrinsic reward. Thus, environments with  
β
. An interesting  
β
Population Based Training (PBT)(Jaderberg et al., 2017) or Meta-gradients(Xu et al., 2018). Another  
advantage of dynamically tuning this hyperparameter would be to allow for the model to become  
completely exploitative when the agent has reached a point in which further exploring does not lead  
to improvements on the exploitative policy. This is not trivially achievable however, as including  
such a mechanism would require calibrating the adaptation to be aligned to the speed of learning of  
the exploitative policy.  
ACKNOWLEDGMENTS  
We thank Daan Wierstra, Steph Hughes-Fitt, Andrea Banino, Meire Fortunato, Melissa Tan, Benigno  
Uria, Borja Ibarz, Mohammad Gheshlaghi Azar, Remi Munos, Bernardo Avila Pires, Andre Barreto,  
Vali Irimia, Sam Ritter, David Raposo, Tom Schaul and many other colleagues at DeepMind for  
helpful discussions and comments on the manuscript.  
10  
Published as a conference paper at ICLR 2020  
REFERENCES  
Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub  
Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous  
in-hand manipulation. arXiv preprint arXiv:1808.00177, 2018.  
Gabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Alistair  
Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy  
gradients. arXiv preprint arXiv:1804.08617, 2018.  
Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos.  
Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information  
Processing Systems, pp. 1471–1479, 2016.  
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning environ-  
ment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:  
2
53–279, 2013.  
Lucas Beyer, Damien Vincent, Olivier Teboul, Sylvain Gelly, Matthieu Geist, and Olivier Pietquin.  
Mulex: Disentangling exploitation from exploration in deep rl. arXiv preprint arXiv:1907.00868,  
2
019.  
Charles Blundell, Benigno Uria, Alexander Pritzel, Yazhe Li, Avraham Ruderman, Joel Z Leibo,  
Jack Rae, Daan Wierstra, and Demis Hassabis. Model-free episodic control. arXiv preprint  
arXiv:1606.04460, 2016.  
Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature  
verification using a" siamese" time delay neural network. In Advances in neural information  
processing systems, pp. 737–744, 1994.  
Yuri Burda, Harri Edwards, Deepak Pathak, Amos Storkey, Trevor Darrell, and Alexei A Efros.  
Large-scale study of curiosity-driven learning. arXiv preprint arXiv:1808.04355, 2018a.  
Yuri Burda, Harrison Edwards, Amos Storkey, and Oleg Klimov. Exploration by random network  
distillation. arXiv preprint arXiv:1810.12894, 2018b.  
Jongwook Choi, Yijie Guo, Marcin Moczulski, Junhyuk Oh, Neal Wu, Mohammad Norouzi,  
and Honglak Lee. Contingency-aware exploration in reinforcement learning. arXiv preprint  
arXiv:1811.01483, 2018.  
Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O Stanley, and Jeff Clune. Go-explore: a new  
approach for hard-exploration problems. arXiv preprint arXiv:1901.10995, 2019.  
Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam  
Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with  
importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018.  
Vincent François-Lavet, Peter Henderson, Riashat Islam, Marc G Bellemare, Joelle Pineau, et al. An  
introduction to deep reinforcement learning. Foundations and TrendsR in Machine Learning, 11  
(3-4):219–354, 2018.  
Nick Haber, Damian Mrowca, Stephanie Wang, Li F Fei-Fei, and Daniel L Yamins. Learning to play  
with intrinsically-motivated, self-aware agents. In Advances in Neural Information Processing  
Systems, pp. 8388–8399, 2018.  
Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado Van Hasselt,  
and David Silver. Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933,  
2
018.  
Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime:  
Variational information maximizing exploration. In Advances in Neural Information Processing  
Systems, pp. 1109–1117, 2016.  
11  
Published as a conference paper at ICLR 2020  
Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David  
Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. arXiv  
preprint arXiv:1611.05397, 2016.  
Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali  
Razavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based training  
of neural networks. arXiv preprint arXiv:1711.09846, 2017.  
Steven Kapturowski, Georg Ostrovski, Will Dabney, John Quan, and Remi Munos. Recurrent  
experience replay in distributed reinforcement learning. In International Conference on Learning  
Representations, 2019.  
Hyoungseok Kim, Jaekyeom Kim, Yeonwoo Jeong, Sergey Levine, and Hyun Oh Song. Emi:  
Exploration with mutual information maximizing state and action embeddings. arXiv preprint  
arXiv:1810.01176, 2018.  
Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot  
image recognition. In ICML deep learning workshop, volume 2, 2015.  
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,  
Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control  
through deep reinforcement learning. Nature, 518(7540):529, 2015.  
Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient off-policy  
reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1046–1054,  
2
016.  
Junhyuk Oh, Xiaoxiao Guo, Honglak Lee, Richard L Lewis, and Satinder Singh. Action-conditional  
video prediction using deep networks in atari games. In Advances in neural information processing  
systems, pp. 2863–2871, 2015.  
Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via  
bootstrapped dqn. In Advances In Neural Information Processing Systems, pp. 4026–4034, 2016.  
Georg Ostrovski, Marc G Bellemare, Aäron van den Oord, and Rémi Munos. Count-based exploration  
with neural density models. In Proceedings of the 34th International Conference on Machine  
Learning-Volume 70, pp. 2721–2730. JMLR. org, 2017.  
Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration  
by self-supervised prediction. In Proceedings of the IEEE Conference on Computer Vision and  
Pattern Recognition Workshops, pp. 16–17, 2017.  
Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan Horgan, David Budden,  
Gabriel Barth-Maron, Hado Van Hasselt, John Quan, Mel Ve  erík, et al. Observe and look further:  
Achieving consistent performance on atari. arXiv preprint arXiv:1805.11593, 2018.  
Alexander Pritzel, Benigno Uria, Sriram Srinivasan, Adrià Puigdomènech, Oriol Vinyals, Demis  
Hassabis, Daan Wierstra, and Charles Blundell. Neural episodic control. ICML, 2017.  
Nikolay Savinov, Anton Raichuk, Raphaël Marinier, Damien Vincent, Marc Pollefeys, Timothy Lilli-  
crap, and Sylvain Gelly. Episodic curiosity through reachability. arXiv preprint arXiv:1810.02274,  
2
018.  
Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Universal value function approximators.  
In International Conference on Machine Learning, pp. 1312–1320, 2015a.  
Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. CoRR,  
abs/1511.05952, 2015b.  
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,  
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering  
the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.  
12  
Published as a conference paper at ICLR 2020  
Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement  
learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.  
Christopher Stanton and Jeff Clune. Deep curiosity search: Intra-life exploration can improve perfor-  
mance on challenging deep reinforcement learning problems. arXiv preprint arXiv:1806.00553,  
2
018.  
Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for  
markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.  
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.  
Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Van Hasselt, Marc Lanctot, and Nando De Freitas.  
Dueling network architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581,  
2
015.  
David Warde-Farley, Tom Van de Wiele, Tejas Kulkarni, Catalin Ionescu, Steven Hansen, and  
Volodymyr Mnih. Unsupervised control through non-parametric discriminative rewards. arXiv  
preprint arXiv:1811.11359, 2018.  
Zhongwen Xu, Hado P van Hasselt, and David Silver. Meta-gradient reinforcement learning. In  
Advances in neural information processing systems, pp. 2396–2407, 2018.  
13  
Published as a conference paper at ICLR 2020  
A EVALUATION SETUP  
The evaluation we do is also identical to the one done in R2D2 Kapturowski et al. (2019): a parallel  
evaluation worker, which shares weights with actors and learners, runs the Q-network against the  
environment. This worker and all the actor workers are the two types of workers that draw samples  
from the environment. For Atari, we apply the standard DQN pre-processing, as used in R2D2. More  
concretely, this is how actors, evaluators, and learner are run:  
Learner:  
i
Sample from the replay buffer a sequence of augmented rewards rt, intrinsic rewards  
observations x, actions a and discounts γi.  
r
,
t
Use Q-network to learn from (rt, x, a) with retrace using the procedure used by R2D2. As  
i
t
specified in Fig. 1, r is sampled because it is fed as an input to the network.  
Use last 5 frames of the sampled sequences to train the action prediction network as specified  
in Section 2. This means that, for every batch of sequences, all time steps are used to train the  
RL loss, whereas only 5 time steps per sequence are used to optimize the action prediction  
loss.  
(If using RND) also use last 5 frames of the sampled sequences to train the predictor of  
RND as also specified in Section 2.  
Evaluator and Actor  
e
i
t1  
Obtain xt, r , r , and discount γi.  
t
With these inputs, compute forward pass of R2D2 to obtain at.  
i
t
With xt, compute r using the embedding network as described in Section 2.  
e
t
i
t
i
t
(actor) Insert xt, at, rt = r + βir , γi, and r in the replay buffer.  
Step on the environment with at.  
Distributed training  
As in R2D2, we train the agent with a single GPU-based learner, performing approximately 5 network  
updates per second (each update on a mini-batch of 64 length-80 sequences, as explained below, and  
each actor performing  260 environment steps per second on Atari. We assign to each actor a fixed  
and the actor acts according to an -greedy version of this policy. More  
-th actor we assign the value βh with h = j mod N  1. In our experiments, we  
N1  
value in the set {βi}  
i=0  
concretely for the  
j
use the following βi:  
0
β
if i = 0  
if i = N  1  
βi =  
2i(N2)  
N2  
β · σ(10  
)
otherwise  
where  
σ
is the sigmoid function. This choice of βi, as you can see in Fig.7(a), allows to focus more  
on the two extreme cases which are the fully exploitative policy and very exploratory policy.  
In the replay buffer, we store fixed-length sequences of (x, a, r) tuples. In all our experiments we  
collect sequences of length 80 timesteps, where adjacent overlap by 40 time-steps. These sequences  
never cross episode boundaries. Additionally, we store in the replay the value of the βi used by the  
actor as well as the initial recurrent state, that we use to initialize the network at training time. Please  
refer to Kapturowski et al. (2019) for a detailed experimental of trade-offs on different treatments  
of recurrent states in replay. Given a single batch of trajectories we unroll both online and target  
networks on the same sequence of states to generate value estimates. We use prioritized experience  
replay. We followed the same prioritization scheme proposed in Kapturowski et al. (2019) using a  
mixture of max and mean of the TD-errors with priority exponent η = 1.0. In addition, we associate  
for each βi a γi such that:  
(
N  1  i) log(1  γmax) + i log(1  γmin)  
N  1  
γi = 1  exp  
,
(4)  
14  
Published as a conference paper at ICLR 2020  
where γmax is the maximum discount factor and γmin is the minimal discount factor. This form  
allows to have discount factors evenly spaced in log-space between 1  γmax and 1  γmin. For more  
N1  
intuition, we provide a graph of the {γi}  
in Fig.7(b) in App.A. We remark that the exploitative  
i=0  
policy β0 is associated with the highest discount factor γ0 = γmax and the most exploratory policy  
βN1 with the smallest discount factor γ0 = γmin. We can use smaller discount factors for the  
exploratory policies because the intrinsic reward is dense and the range of values is small, whereas  
we would like the highest possible discount factor for the exploitative policy in order to be as close  
as possible to optimizing the undiscounted return. In our experiments, we use γmax = 0.997 and  
γmin = 0.99.  
N1  
i=0  
N1  
i
(b) Values taken by the {γ }  
i=0  
(a) Values taken by the {β  
i
}
Figure 7: Values taken by the {βi} Ni =01 and the {γi} iN= 01 for N = 32 and β = 0.3.  
A.1 NEVER-GIVE-UP INTRINSIC REWARD ALGORITHM  
We present the algorithm for computing the intrinsic reward in Alg. 1. We follow the notations  
defined in Sec. 2 in the paragraph relative to the episodic intrinsic reward:  
M
{
the episodic memory containing at time  
t
the previous embeddings  
f(x0), f(x1), . . . , f(xt1)}.  
k is the number of nearest neighbours.  
k
i=1  
Nk = {fi}  
is the set of k-nearest neighbours of f(xt) in the memory M.  
K
the kernel defined as K(x, y) =  
2
where  
is a small constant,  
d
is the Euclidean  
-nearest  
d
(x,y)  
+
d2  
m
2
m
distance and  
neighbors.  
d
is a running average of the squared Euclidean distance of the  
k
c is the pseudo-counts constant.  
ξ cluster distance.  
sm maximum similarity.  
A.2 COMPLEXITY ANALYSIS  
The space complexity is constant. The number of weights that the network has can be computed  
from the architecture seen in App. F. Furthermore, for our episodic memory buffer, we pre-allocate  
memory at the beginning of training, with size detailed in App. F. In cases in which the episode is  
longer than the size of the memory, the memory acts a ring buffer, deleting oldest entries first.  
Time complexity is O(M · N), where  
N is the number of frames, and M is the size of our memory.  
This is due to the fact that we do one forward pass per frame, and we compute the distance from the  
embeddings produced by the embeddings network to the contents of our memory in order to retrieve  
the k-nearest neighbors.  
15  
Published as a conference paper at ICLR 2020  
episodic  
t
Algorithm 1: Computation of the episodic intrinsic reward at time t: r  
.
2
m
Input :M; k; f(xt); c; ; ξ; sm; d  
episodic  
Output :rt  
1
2
Compute the k-nearest neighbours of f(xt) in M and store them in a list Nk  
Create a list of floats dk of size k  
/
The list dk will contain the distances between the embedding  
f(xt) and its neighbours Nk.  
*
/
*
3
for i ∈ {1, . . . , k} do  
2
4
5
6
dk[i]  d (f(xt), Nk[i])  
end  
2
Update the moving average dm with the list of distances dk  
Normalize the distances dk with the updated moving average d .  
2
m
/
*
/
*
dk  
7
dn ←  
2
d
m
/
Cluster the normalized distances dn i.e. they become 0 if too  
small and 0k is a list of k zeros.  
*
/
/
*
8
9
dn  max(dn  ξ, 0k)  
Compute the Kernel values between the embedding f(xt) and its  
*
/
neighbours Nk.  
*
Kv ←  
dn+ꢀ  
/
Compute the similarity between the embedding f(xt) and its  
neighbours Nk.  
*
/
/
*
*
q
P
k
i=1  
1
1
0
1
s ←  
Kv[i] + c  
i
t
/
Compute the episodic intrinsic reward at time t: r .  
if s > sm then  
*
episodic  
1
1
2
3
r
0  
t
else  
episodic  
1
1
4
r
s  
t
B ABLATIONS FOR NGU(N=1)  
As mentioned in Section 4.2, we here show ablations on the size of the learned controllable states, the  
clipping factor in (1), and the number of nearest neighbours to use for computing pseudo-counts in  
2).  
L
(
Due to the lack of a pure exploitative mode, as seen in 4.2, NGU(N=1) fails to perform well in  
dense reward games. Therefore, in order to obtain high signal from these ablations, we analyze the  
performance of NGU(N=1) on the two most popular sparse reward games: Montezuma’s Revenge  
and Pitfall!.  
B.1 SIZE OF CONTROLLABLE STATES  
In Fig. 8 and Fig. 9 we can see the performance of NGU(N=1) with different sizes of the size of  
the controllable state on Pitfall! and Montezuma’s Revenge respectively. As we can observe, that  
there is small to no impact on Pitfall!, with scores that sometimes reach more than 25,000 points.  
On Montezuma’s Revenge 32 is the value that is consistently better than 64. A size of 16 as the  
controllable state size sometimes solves the level, but is in general less stable.  
B.2 NEAREST NEIGHBORS USED  
We proceed to show a similar analysis on Fig. 10 and Fig. 11 regarding the amount of nearest  
neighbors on Pitfall! and Montezuma’s Revenge respectively. As we can see, there are slight gains  
16  
Published as a conference paper at ICLR 2020  
Figure 8: Mean episodic return for agents  
trained Pitfall!.  
Figure 9: Mean episodic return for agents  
trained Montezuma’s Revenge.  
Figure 10: Mean episodic return for agents  
trained Pitfall!.  
Figure 11: Mean episodic return for agents  
trained Montezuma’s Revenge.  
from using more neighbors on Pitfall!, whereas there is a clear difference in performance in using 10  
neighbors in Montezuma’s Revenge when compared to using 5 or 30 neighbors.  
B.3 CLIPPING FACTOR L  
Finally, we show the performance of NGU(N=1) on Fig. 12 and Fig. 13 regarding the clipping factor  
L
Pitfall! and Montezuma’s Revenge respectively. As we can observe, Pitfall! is again robust to  
the value of this hyperparameter, with marginally worse performance in the case of L = 10. This is  
expected, as RND is generally detrimental to the performance of NGU on Pitfall!, as seen in Section  
4
.2. On the other hand, the highest value of clipping appears to work best on Montezuma’s Revenge.  
In our initial investigations, we observed that clipping this value was required on Montezuma’s  
Revenge to make the algorithm stable. Further analysis is required in order to show the range of  
L that are higher than 10 and are detrimental to the performance of NGU(N=1) on this task.  
values of  
C ABLATIONS FOR NGU(N=32)  
C.1 GENERAL ABLATIONS  
Tab. 3 shows the results for all the ablations we performed on 8 games for NGU(N = 32). We can  
see that the conclusions of Sec. 4.2 hold, with a few additional facts to observe:  
The best score in Montezuma’s Revenge is obtained by using a non-zero Cross Mixture  
Ratio, even though it is relatively close to the score obtained by NGU(N = 32).  
N = 2 and N = 8 have lower average human normalized score on the set of 3 hard  
exploration games when compared to N = 16 or N = 32. Concretely on the set of hard  
Figure 12: Mean episodic return for agents  
trained Pitfall!.  
Figure 13: Mean episodic return for agents  
trained Montezuma’s Revenge.  
17  
Published as a conference paper at ICLR 2020  
Space  
Invaders  
1.6k  
Algorithm  
Pong  
Qbert  
13.4k  
Breakout  
30.5  
Beam Rider  
16.9k  
MR  
Pitfall!  
6.5k  
PrivateEye  
69.6k  
Human  
NGU(N=32)  
N=2  
N=8  
N=16  
CMR=0.5  
β=0.2  
β=0.5  
14.6  
4.8k  
19.6±0.1  
20.6±0.1  
20.0±0.3  
17.2±1.0  
19.0±0.3  
20.3±0.2  
16.8±1.2  
19.7±0.4  
-8.4±1.9  
465.8k±84.9k 532.8±16.5 44.6k±1.2k 68.7k±11.1k 10.4k±1.6k 8.4k±4.5k 100.0k±0.4k  
457.7k±128.0k 576.0±24.9 48.0k±6.2k 71.8k±9.0k 11.1k±1.4k -1.6±0.9 40.6k±0.1k  
481.8k±41.2k 524.1±28.3 43.1k±5.5k 64.0k±3.8k 7.8k±0.3k -1.9±0.4 38.5k±1.6k  
444.2k±61.7k 549.0±11.0 46.4k±3.5k 73.9k±5.3k  
502.9k±73.9k 516.9±20.1 40.3k±7.1k 74.5k±5.9k  
9.5k±0.8k 5.0k±1.2k 52.0k±20.7k  
12.0k±0.8k 5.1k±2.4k 58.8k±16.9k  
350.1k±24.7k 525.0±38.2 43.9k±1.3k 75.6k±10.9k 6.9k±0.1k 3.3k±1.4k 40.6k±0.0k  
480.4k±59.2k 451.6±17.2 40.2k±4.8k 62.3k±5.9k  
550.3k±64.4k 553.7±19.1 47.6k±3.7k 87.1k±9.1k  
11.3k±0.5k 1.3k±0.5k 44.5k±3.2k  
3.0k±0.0k 7.7k±0.9k 40.5k±0.1k  
3.4k±0.7k 600.4±468.9 7.5k±2.2k  
w/o RND  
e
w/o r  
28.1k±1.1k  
383.0±10.0 5.5k±0.3k 6.4k±0.2k  
Table 3: Ablation results on 8 Atari games.  
Algorithm  
Human  
Gravitar  
3.4k  
Solaris  
Venture  
12.3k  
1.2k  
NGU(N=32)  
β=0.2  
β=0.5  
14.1k±0.5k  
14.2k±0.3k  
13.5k±0.4k  
4.9k±0.3k  
6.4k±1.4k  
5.5k±1.4k  
1.7k±0.1k  
2.3k±0.2k  
2.0k±0.1k  
Table 4: Further ablation results on hard exploration games.  
exploration games of Tab. 3, they only achieve super-human performance on Montezuma’s  
Revenge.  
Even though we have seen that the results of β = 0.2 and β = 0.5 have lower average  
on the 3 hard exploration games of Tab. 3, they still individually outperform RND, R2D2,  
R2D2(Retrace), and R2D2+RND on Pitfall! and Private Eye.  
In the case of Private Eye the distance in score might be misleading, as rewards are very  
sparse of large value. For instance, after reaching a score of 40k, if we ignore minor rewards,  
there are only two rewards to be collected of around 30k points. This creates what seems to  
be large differences in scores.  
On Breakout, a high score is achieved without extrinsic reward. This is due to the fact that  
the exploratory policy learns to survive, which eventually leads to a high score.  
C.2 FURTHER ABLATIONS ON HARD EXPLORATION  
On Tab.4 we show further results on the case of β = 0.2 and β = 0.5. We compare them to human  
performance as well as the base NGU(N = 32), with β = 0.3.  
As we can observe, in this case the difference in terms of relative performance among games is less  
pronounced than the ones observed on Tab. 3. In fact, results are slightly better for both values of β  
on all 3 games, with a maximum difference of 1.5k points on Solaris between β = 0.3 and β = 0.2.  
We hypothesize that this is due to the nature of these specific games: the policies learnt on these three  
games seem to focus on exploitation rather than extended exploration of the environment, and in that  
case, similar to what we see for dense reward games in Sec. 4.2, the method shows less variability  
with respect to this hyperparameter.  
D ALGORITHM COMPUTATION COMPARISON  
On Tab. 5 we can see a comparison of the computation used between different algorithms.  
Computation is still difficult to compare even when taking actor steps and parameter updates into  
account: distributed the number of actors in distributed setups will affect how much data the learner  
will be able to consume, but also how off-policy such data is (e.g. in R2D2, if a learner is learning  
from many actors, the data that is sampled from the replay buffer will be more recent than with fewer  
actors).  
E DETAILS ON THE RETRACE ALGORITHM  
Retrace (Munos et al., 2016) is an off-policy Reinforcement Learning algorithm that can be used  
for evaluation or control. In the evaluation setting the goal is mainly to estimate the action-value  
π
function  
Q
of a target policy  
π
from trajectories drawn from a behaviour policy  
µ
. In the control  
18  
Published as a conference paper at ICLR 2020  
Algorithm  
Number of actors Total Number of frames  
R2D2 Kapturowski et al. (2019)  
R2D2(Retrace)  
256  
256  
256  
256  
256  
256  
1
35 B  
35 B  
R2D2 + RND  
35 B  
NGU-RND(N=1)  
35 B  
NGU (N=1)  
35 B  
NGU (N=32)  
35 B  
DQN + PixelCNN Ostrovski et al. (2017)  
DQN + CTS Bellemare et al. (2016)  
RND** Burda et al. (2018b)  
PPO + CoEx Choi et al. (2018)  
150 M  
150 M  
16 B (2 B)  
2 B  
1
1024  
1024  
Table 5: Comparison in number of steps.  
*
*To obtain its best reported result on Montezuma’s Revenge 16B (which we compare against), the  
results table of the work are with 2B.  
setting the target policy, or more precisely the sequence of target policies, depends on the sequence of  
Q-functions that will be generated through the process of approximating Q . To do so, we consider  
trajectories  
τ starting from the state-action couple (x, a) and then following the behaviour policy µ  
of the form:  
τ = (xt, at, rt, xt+1)  
,
(5)  
t  0, rt = r(xt, at) and t  0, xt+1  P(.|xt, at).  
generated by the behaviour policy starting  
tN  
with (x0, a0) = (x, a)  
,
t  1, at  µ(.|xt)  
,
The expectation Eµ is over all admissible trajectories  
τ
µ
in state x doing action a and then following the behaviour policy µ.  
The general Retrace operator T , that depends on µ and π, is:  
!
t
X Y  
t
T Q(x, a) = Q(x, a) + Eµ  
γ
cs δt  
,
(6)  
(7)  
t0  
s=1  
where the temporal difference δt is defined as:  
X
δt = rt + γ  
π(a|xt+1)Q(xt+1, a) Q(xt, at),  
aA  
and the cutting traces coefficients cs as:  
π(as|xs)  
µ(as|xs)  
cs = λ min 1,  
.
(8)  
(9)  
Theorem 2 of Munos et al. (2016) explains in which conditions the sequence of Q-functions:  
Qk+1 = TkQk,  
where Tk depends on the policy-couple (µk, πk) converges to the optimal  
Q
-value  
Q
. In particular  
-greedy with respect to  
one of the conditions is that the sequence of target policies πk is greedy or  
Qk (more details can be found in Munos et al. (2016)).  
t+k  
In practice, at a given time t, we can only consider finite sampled sequences (xs, as, rs, xs+1)s=t  
starting from (xt, at) and then following the behaviour policy µ. Therefore, we define the finite  
sampled-Retrace operator as:  
!
t+k1  
s
X
Y
st  
ˆ
TQ(xt, at) = Q(xt, at) +  
γ
ci δs.  
(10)  
s=t  
i=t+1  
In addition, we use two neural networks. One target network Q(x, a; θ ) and an online network  
Q(x, a; θ). The target network is used to compute the target value yˆ t that the online network will try  
to fit:  
ˆ
yˆ t = TQ(xt, at; θ ),  
(11)  
!
!
t+k1  
s
X
Y
X
st  
=
Q(xt, at; θ ) +  
γ
ci  
rs + γ  
π(a|xs+1)Q(xs+1, a; θ )  Q(xs, as; θ )  
.
s=t  
i=t+1  
aA  
(12)  
19  
Published as a conference paper at ICLR 2020  
In the control scenario the policy chosen π(a|x) is greedy or -greedy with respect to the online  
network Q(x, a; θ). Then, the online network is optimized to minimize the loss:  
2
L(xt, at, θ) = (Q(xt, at; θ)  yˆ t) .  
(13)  
More generally, one can use transformed Retrace operators(Pohlen et al., 2018):  
  
!
t
X Y  
h
1  
t
h
δ
t
  
T Q(x, a) = Eµ  
h
h
(Q(x, a)) +  
γ
cs  
,
(14)  
(15)  
t0  
s=1  
R
h
t
where h  R is a real-function and the temporal difference δ is defined as:  
X
h
π(a|xt+1)h1(Q(xt+1, a)) h1(Q(xt, at)).  
δ = rt + γ  
t
aA  
The role of the function  
h is to reduce (squash) the scale of the action-value function to make it easier  
to approximate for a neural network without changing the optimal property of the operator . In  
T
particular, we use the function h:  
p
z  R, h(z) = sign(z)( |z| + 1  1) + ꢀz,  
(16)  
(17)  
!
!
p
1
+ 4(|z| + 1 + ) 1  
1
z  R, h (z) = sign(z)  
1  
,
2ꢀ  
2
.
with  = 10  
20  
Published as a conference paper at ICLR 2020  
F HYPERPARAMETERS  
F.1 SELECTION OF HYPERPARAMETERS  
In order to select the hyperparameters used for NGU(N = 32) for all 57 Atari games, which are  
3 seeds on the  
set of Atari games shown in Tab. 3. Regarding the hyperparameters concerning the kernel K  
Kernel and the number of neighbors used), we fixed them after determining suitable ranges of  
the intrinsic reward in our initial experimentation on Atari. After running the grid search with those  
hyperparameters, we selected the combination with the highest amount games (out of ) that held a  
shown on Tab. 6, we ran a grid search with the ranges shown on Tab. 9. We used  
8
(
8
score greater than our human benchmark. As one can see on the multiple mixtures ablations seen on  
Tab. 3, as well as the single mixture ablations on App B, the only agent that achieved superhuman  
performance on the set of 8 games is NGU(N = 32).  
Finally, in order to obtain the R2D2+RND baseline, we ran a sweep over the  
β
hyperparameter with  
values 0.1, 0.3, and 0.5, over the 8 games shown in Tab. 3. Coincidentally, like NGU(N = 32), the  
best value of β was determined to be 0.3.  
F.2 COMMON HYPERPARAMETERS  
These are the hyperparameters used in all the experiments. We expose a full list of hyperparameters  
here for completeness. However, as one can see, the R2D2-related architectural hyperparameters are  
identical to the original R2D2 hyperparameters. Shown in Tab. 6.  
Hyperparameter  
Number of Seeds  
Cross Mixture Ratio  
Number of mixtures N  
Optimizer  
Learning rate (R2D2)  
Learning rate (RND and Action prediction)  
Adam epsilon  
Value  
3
0.0  
32  
AdamOptimizer (for all losses)  
0.0001  
0.0005  
0.0001  
0.9  
Adam beta1  
Adam beta2  
0.999  
40  
0.99  
0.997  
64  
Adam clip norm  
i
Discount r  
Discount r  
e
Batch size  
Trace length  
Replay period  
Retrace λ  
R2D2 reward transformation  
Episodic memory capacity  
Embeddings memory mode  
Intrinsic reward scale β  
Kernel ꢀ  
Kernel num. neighbors used  
Kernel cluster distance ξ  
Kernel pseudo-counts constant c  
Kernel maximum similarity s  
Replay priority exponent  
Replay capacity  
80  
40  
0.95  
p
sign(x) · ( |x| + 1  1) + 0.001 · x  
30000  
Ring buffer  
0.3  
0.0001  
10  
0.008  
0.001  
8
m
0.9  
5e6  
Minimum sequences to start replay  
Actor update period  
Target Q-network update period  
Embeddings target update period  
Action prediction network L2 weight  
RND clipping factor L  
Evaluation ꢀ  
6250  
100  
1500  
once/episode  
0.00001  
5
0.01  
0.01  
Target ꢀ  
Table 6: Common hyperparameters.  
21  
Published as a conference paper at ICLR 2020  
F.3 DISCO MAZE HYPERPARAMETERS  
Hyperparameters are shown in Tab. 7.  
Hyperparameter  
Value  
5000  
0.001  
1e6  
0.5  
50  
Episodic memory capacity  
Learning rate (R2D2 and Action prediction)  
Replay capacity  
Intrinsic reward scale β  
Trace length  
Replay period  
Retrace λ  
Retrace loss transformation  
Num. action repeats  
50  
0.97  
identity  
1
Target Q-network update period  
Q-network filter sizes  
Q-network filter strides  
Q-network num. filters  
Action prediction network filter sizes  
Action prediction network filter strides  
Action prediction network num. filters  
Kernel ꢀ  
100  
(3, 3)  
(1, 1)  
(16, 32)  
(3, 3)  
(1, 1)  
(16, 32)  
0.01  
0
Evaluation ꢀ  
Table 7: Disco Maze hyperparameters.  
F.4 ATARI PRE-PROCESSING HYPERPARAMETERS  
Hyperparameters are shown in Tab. 8.  
Hyperparameter  
Value  
30 min  
4
Max episode length  
Num. action repeats  
Num. stacked frames  
Zero discount on life loss  
Random noops range  
Sticky actions  
1
false  
30  
false  
Frames max pooled  
Grayscaled/RGB  
Action set  
3 and 4  
Grayscaled  
Full  
Table 8: Atari pre-processing hyperparameters.  
F.5 HYPERPARAMETER RANGES  
On Tab. 9 we can see the ranges we used to sweep over in our experiments.  
Hyperparameter  
Value  
Intrinsic reward scale β  
Number of mixtures N  
Cross Mixture Ratio  
{0.2, 0.3, 0.5}  
{1, 2, 8, 16, 32}  
{0.0, 0.25, 0.5}  
{1, 3}  
#
Episodes w/o wiping Episodic Memory  
Table 9: Range of hyperparameters sweeps.  
22  
Published as a conference paper at ICLR 2020  
G DETAILED ATARI RESULTS  
Figure 14: R2D2(Retrace) (green), NGU(N=32) with eval β = 0.0 (blue) and eval β = 0.3 (orange).  
23  
Published as a conference paper at ICLR 2020  
Game  
R2D2(Retrace) NGU(32) eval beta=0.0 NGU(32) eval beta=0.3  
alien  
189.1k±15.6k  
338.5k±4.5k  
446.8k±3.9k  
452.2±20.0  
678.8k±1.3k  
2.0k±0.0k  
248.1k±22.4k  
230.5k±4.0k  
344.7k±31.0k  
191.1±1.2  
225.5k±36.9k  
181.6k±2.8k  
336.1k±32.9k  
125.7±5.2  
asteroids  
time pilot  
tutankham  
up n down  
venture  
video pinball  
wizard of wor  
atlantis  
yars revenge  
zaxxon  
bank heist  
battle zone  
beam rider  
berzerk  
620.1k±13.7k  
1.7k±0.1k  
575.2k±10.4k  
779.9±188.7  
596.7k±120.0k  
85.1k±12.3k  
1638.0k±4.6k  
993.8k±1.5k  
190.1k±6.9k  
1.4k±0.0k  
948.6k±5.2k  
120.2k±7.6k  
1654.9k±1.5k  
990.4k±2.9k  
94.8k±23.7k  
15.0k±5.7k  
733.1k±45.2k  
103.9k±1.1k  
69.3k±5.8k  
253.0±1.2  
965.3k±12.8k  
106.2k±7.0k  
1653.6k±2.3k  
986.0k±3.2k  
111.1k±11.8k  
17.4k±12.0k  
691.7k±22.6k  
63.6k±8.6k  
36.2k±4.7k  
211.9±8.4  
571.5k±45.4k  
31.9k±0.6k  
27.3k±0.8k  
161.8±4.9  
bowling  
boxing  
100.0±0.0  
99.7±0.0  
3.8±1.4  
breakout  
centipede  
chopper command  
amidar  
crazy climber  
defender  
demon attack  
double dunk  
enduro  
839.7±6.0  
559.2±29.0  
577.8k±3.1k  
999.9k±0.0k  
17.8k±1.2k  
313.4k±7.7k  
664.1k±1.8k  
143.5k±0.0k  
-14.1±2.4  
387.6±51.0  
574.6k±7.0k  
974.1k±14.7k  
8.5k±0.4k  
700.2k±19.8k  
999.9k±0.0k  
28.2k±0.2k  
294.2k±25.7k  
675.6k±3.2k  
143.9k±0.0k  
24.0±0.0  
357.0k±8.3k  
656.3k±7.2k  
136.6k±3.3k  
-21.5±0.7  
2.4k±0.0k  
2.0k±0.1k  
682.5±24.0  
-29.3±4.1  
fishing derby  
freeway  
86.8±0.9  
32.0±1.9  
33.2±0.1  
28.5±1.1  
21.2±0.5  
frostbite  
10.5k±3.8k  
117.8k±0.9k  
12.9k±0.6k  
36.7k±3.0k  
54.5k±3.0k  
85.1±0.6  
206.4k±150.0k  
113.4k±0.6k  
14.2k±0.5k  
34.8k±5.4k  
69.4k±5.7k  
-4.1±0.3  
67.9k±45.6k  
86.6k±7.4k  
13.2k±0.3k  
1.9k±0.6k  
gopher  
gravitar  
assault  
hero  
64.2k±7.0k  
-0.5±0.2  
ice hockey  
jamesbond  
kangaroo  
krull  
30.8k±2.3k  
14.7k±0.1k  
131.1k±12.7k  
220.7k±3.9k  
2.3k±0.4k  
26.6k±2.6k  
35.1k±2.1k  
127.4k±17.1k  
212.1k±11.2k  
10.4k±1.5k  
40.8k±1.1k  
23.9k±0.5k  
959.1k±2.7k  
950.7k±23.1k  
7.8k±4.0k  
22.4k±0.3k  
16.9k±2.0k  
26.7k±2.2k  
203.2k±10.8k  
16.8k±6.8k  
38.1k±1.6k  
15.6k±0.3k  
933.3k±4.7k  
953.1k±21.7k  
1.3k±0.9k  
kung fu master  
montezuma revenge  
ms pacman  
name this game  
phoenix  
40.3k±1.1k  
70.6k±8.3k  
935.2k±13.5k  
994.3k±0.9k  
-5.2±1.7  
asterix  
pitfall  
pong  
20.9±0.0  
19.6±0.2  
-9.7±1.4  
private eye  
qbert  
31.5k±5.4k  
398.6k±46.4k  
38.9k±0.6k  
105.2k±16.8k  
142.4±0.8  
100.0k±0.4k  
451.9k±82.1k  
36.7k±2.3k  
128.6k±28.7k  
9.1±0.7  
65.6k±14.3k  
449.0k±108.2k  
42.6k±4.0k  
103.9k±7.1k  
24.9±2.5  
riverraid  
road runner  
robotank  
seaquest  
1000.0k±0.0k  
-11081.7±97.7  
5.6k±1.1k  
1000.0k±0.0k  
-22977.9±5059.9  
4.7k±0.3k  
1000.0k±0.0k  
-21907.4±3647.5  
1.3k±0.1k  
skiing  
solaris  
space invaders  
star gunner  
surround  
tennis  
33.4k±12.2k  
412.8k±4.1k  
9.8±0.0  
43.4k±1.3k  
414.6k±66.8k  
-9.6±0.2  
12.0k±1.9k  
452.5k±53.0k  
-7.3±0.4  
24.0±0.0  
10.2±3.5  
-3.5±5.7  
Table 10: Scores over all 57 Atari games.  
24  
Published as a conference paper at ICLR 2020  
H NETWORK ARCHITECTURES  
H.1 ARCHITECTURE OF THE EMBEDDING NETWORK WITH INVERSE DYNAMICS PREDICTION  
Softmax  
Fully Connected (Units: 18)  
RELU  
Fully Connected (Units:  
128)  
RELU  
RELU  
Fully Connected (Units: 32)  
Fully Connected (Units: 32)  
RELU  
RELU  
Conv (Kernel:3x3  
Conv (Kernel:3x3  
Stride:1x1 Channels:64)  
Stride:1x1 Channels:64)  
RELU  
RELU  
Conv (Kernel:4x4 Stride:2x2  
Channels:64)  
Conv (Kernel:4x4 Stride:2x2  
Channels:64)  
RELU  
RELU  
Conv (Kernel:8x8 Stride:4x4  
Channels:32)  
Conv (Kernel:8x8 Stride:4x4  
Channels:32)  
Figure 15: Embedding Network Architecture.  
H.2 ARCHITECTURE OF THE RANDOM NETWORK DISTILLATION  
Fully Connected (Units:  
28)  
Fully Connected (Units:  
128)  
1
RELU  
RELU  
Conv (Kernel:3x3  
Conv (Kernel:3x3  
Stride:1x1 Channels:64)  
Stride:1x1 Channels:64)  
RELU  
RELU  
Conv (Kernel:4x4 Stride:2x2  
Channels:64)  
Conv (Kernel:4x4 Stride:2x2  
Channels:64)  
RELU  
RELU  
Conv (Kernel:8x8 Stride:4x4  
Channels:32)  
Conv (Kernel:8x8 Stride:4x4  
Channels:32)  
Figure 16: RND Network Architecture.  
25  
Published as a conference paper at ICLR 2020  
-
Mean  
+
Fully Connected (Units:1)  
Fully Connected (Units:18)  
RELU  
RELU  
Fully Connected  
Fully Connected  
(Units:512)  
(Units:512)  
RL Head  
LSTM Core  
(Units: 512)  
RELU  
Fully Connected (Units:  
12)  
5
LSTM Core  
RELU  
Conv (Kernel:3x3  
Stride:1x1 Channels:64)  
RELU  
Conv (Kernel:4x4 Stride:2x2  
Channels:64)  
Convolutional  
Torso  
RELU  
Conv (Kernel:8x8 Stride:4x4  
Channels:32)  
(a) Sketch of the R2D2 Agent (b) Detailed R2D2 Agent  
Figure 17: R2D2 Agent Architecture.  
H.3 ARCHITECTURE OF THE R2D2 AGENT  
I CONTROLLABLE STATES  
In this section we evaluate properties of the learned controllable states. We further present a study of  
the performance of the algorithm when having access to oracle controllable states containing only the  
necessary information. We use Montezuma’s Revenge as a case-study.  
I.1 INSPECTING THE PROPERTIES OF LEARNED CONTROLLABLE STATES  
As explained in Section 2, we train the embedding network  
f using an inverse dynamics model  
as done by Pathak et al. (2017). Intuitively, the controllable states should contain the information  
relevant to the action performed by the agent given two consecutive observations. However it might  
contain other type of information as long as it can be easily ignored by our simple classifier, g.  
As noted in Burda et al. (2018b), for this game, one can identify a novel state by using five pieces of  
information: the (x, y) position of the player, a room identifier, the level number, and the number of  
keys held. This information can be easily extracted from the RAM state of the game as described in  
Section I.3 bellow. One question that we could ask is whether this information is present (or easily  
decodable) or not in the learned controllable state. We attempted to answer this question by training a  
linear classifier to predict the (x, y) coordinates and the room identified from the learned controllable  
state. Importantly we do not backpropagate the errors to the embedding network f. Figure 19 shows  
the average results over the episodes as the training of the agent progresses. We can see that the  
squared error in predicting the (x, y) position of the agent stabilises to a more or less constant value,  
which suggests that it can successfully generalise to new rooms (we do not observe an increase in  
the error when new rooms are discovered). The magnitude of the error is of the order of 12 units,  
which less than 10% of the range (see Section I.3). This is to be expected, as it is the most important  
information for predicting which action was taken. It shows that the information is quite accessible  
and probably has a significant influence in the proposed novelty measure. The room identifier, on the  
other hand, is information that is not necesary to predict the action taken by the agent. Unlike the  
previous case, one can see jumps in the error as training progresses as the problem becomes harder.  
26  
Published as a conference paper at ICLR 2020  
Figure 18: Prediction loss from learned controllable states of the position (top) and room (bottom) on  
Montezuma’s Revenge against minibatch updates.  
It stabilises around an error slightly above 20%, which is reasonably good considering that random  
chance is 96%. This means that even if there is nothing specifically encouraging this information to  
be there, it is still present and in turn can influence the proposed novelty signal.  
An avenue of future work is to research alternative methods for learning controllable states that  
directly search for retaining all relevant information. While very good results can be obtained with  
one of the simple alternative of an inverse dynamics model, it is reasonable to think that better results  
could be attained when using a better crafted one. To inform this question, we investigate in the next  
section what results could we obtain if we explicitly use as controllable states the quantities that we  
were trying to predict in this section.  
I.2 MONTEZUMAS REVENGE WITH HAND-CRAFTED CONTROLLABLE STATES  
In the previous section we analysed the properties of the learned controllable states. A valid question  
to ask is: how would the NGU work if we had access to an oracle controllable state containing only  
the relevant information? This analysis is a form of upper bound performance for a given agent  
architecture. We ran the NGU(N=1) model with two ablations: without RND and without extrinsic  
rewards. Instead of resetting the memory after every episode, we do it after a small number of  
consecutive episodes, which we call a meta-episode. This structure plays an important role when the  
agent faces irreversible choices. In this setting, approaches using non-episodic exploration bonuses  
are even more susceptible to suffer from the “detachment” problem described in Ecoffet et al. (2019).  
The agent might switch between alternatives without having exhausted all learning opportunities,  
rendering choosing the initial option uninteresting from a novelty perspective. The episodic approach  
with a meta-episode of length one would be forced to make similar choices. However, when run with  
multiple episodes it can offer an interesting alternative. In the first episode, the agent starts with an  
empty episodic memory can can choose arbitrarily one of the options. In the second episode, the  
episodic memory contains all the experience collected in the first episode. The agent is then rewarded  
for not repeating the strategy followed in the first one, as revisiting those states will lead to lower  
intrinsic reward. Thus, the agent is encourage to learn diverse behaviour across episodes without  
needing to choose between alternatives nor being susceptible to the detachment problem. Results are  
summarized in Fig. 19. We report the average episodic return (left) as well as the average number of  
visited rooms per meta-episode (right). The model achieves higher scores than the one using learned  
controllable states (as reported in Section 4.2).  
Incorporating long-term novelty in the exploration bonus, encourages the agent to concentrate in  
the less explored areas of the environment. Similarly to what we observed with learned controllable  
states, this provides a boost both in data efficiency as well as final performance, obtaining close to  
1
5,000 average return and visiting an average of 25 rooms per episode. In this run, three out of five  
seeds reach the second level of the game, one of which reaches the third level with an average of  
fifty different rooms per episode. We also observe that, when running in the absence of extrinsic  
rewards, the agent remarkably still achieves a very high extrinsic reward. Secondly, the agent is able  
27  
Published as a conference paper at ICLR 2020  
Figure 19: Mean episodic return (left) and mean number of visited rooms per episode (right) vs  
environment frames for agents trained Montezuma’s Revenge with hand-crafted controllable states.  
to consistently reach a large number of rooms and explore more than 20 rooms without any extrinsic  
guidance.  
As noted in Burda et al. (2018b), in Montezuma’s Revenge each level contains 6 doors and 4 keys.  
If the agent walks through a door holding a key, it receives a reward of 300 consuming the key in  
the process. In order to clear a level, the agent needs open two doors located just before the final  
room. During exploration, the agent needs to hold on to two keys to see what it could do with  
them later in the episode, sacrificing the immediate reward of opening more accessible doors. Any  
agent that acts almost greedily will struggle with what looks like a high level choice. With the  
right representation and using meta-episodes, our method can handle this problem in an interesting  
way. When the number of keys held is represented in the controllable state, the agent chooses a  
different key-door combination on each of the three episodes in which we do not wipe our episodic  
memory. At the end of training, in the first episode after wiping the episodic memory, our agent  
shows a score of 14, 660 ± 196, while the third episode the agent shows a score of 34, 040 ± 9, 835  
,
3
exploring on average over 30 rooms and consistently going to the second level . The agent learns  
a complex exploratory policy spanning several episodes that can handle irreversible choices and  
overcome “distractor” rewards. We do not observe different key-door combinations across episodes  
when using learned controllable states. Presumably the signal of the number of held keys in the  
learned controllable states is not strong enough to treat them as sufficiently different.  
The results describe in this section support the idea that significant gains can be obtained by im-  
proving the respresentation of the controllable states, suggesting that the study of learning better  
representations is an interesting line for future work. Recent works have explored ways of measuring  
novelty by learning controllable aspects of an environment (Kim et al., 2018; Warde-Farley et al.,  
2
018), and we believe that some of these ideas could be also useful in this setting.  
I.3 HAND-CRAFTED STATE FEATURES FOR MONTEZUMAS REVENGE  
We obtain the hand-crafted features for Montezuma’s Revenge by observing the RAM state of the  
game at every time step. More concretely:  
x and y can be observed at positions 0xAA and 0xAB respectively, represented by integers  
with a range of [0, 153] × [0, 122].  
Room id and level number can be found in positions 0x83 and 0xB9 respectively. We  
provide this information as a single integer to our agent in the form of rid + 24  ln where  
rid  {0, . . . , 23} is the room id, and ln is the level number.  
Byte 0xC1 is the player’s inventory. We count the number of keys being held (and provide  
this information to the agent) by adding the bits {2, . . . , 6}, which correspond to the binary  
slots for keys.  
3See video of the three episodes at https://sites.google.com/view/nguiclr2020  
28  
仅选取文字
重要
更重要
最重要
关键内容
同意
不同意
您已达到评论上限
基本用户只能在每文件上写3条评论。
升级至LINER Premium,并撰写无限数量的评论。
升级

Premium版本提供的功能

观看广告,解锁所有Premium功能。

立即观看